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Abstract 



• One conjecture in both deep learning and classical connectionist viewpoint is 

that the biological brain implements certain kinds of deep networks as its back- 
end. However, to our knowledge, a detailed correspondence has not yet been set 
up, which is important if we want to bridge between neuroscience and machine 
learning. Recent researches emphasized the biological plausibility of Linear- 
t/3 . Nonlinear-Poisson (LNP) neuron model. We show that with neurally plausible 

^ choices of parameters, the whole neural network is capable of representing any 

Boltzmann machine and performing a semi-stochastic Bayesian inference algo- 
rithm lying between Gibbs sampling and variational inference. 



1 Introduction 
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. Classical connectionist viewpoint has long been inspired from how the brain works, such as the 

invention of "perceptron" 1 1 1, and the "parallel distributed processing" approach |2|. Modern deep 
\ learning and unsupervised feature learning also assume connections between biological brain and 

certain kinds of deep networks, either probabilistic or not. For example, properties of visual area 
V2 are found to be comparable to those on the sparse autoencoder networks |3|; the sparse coding 
learning algorithm |4| is originated directly from neuroscience observations; also psychological 

■ phenomenon such as end-stopping is observed in sparse coding experiments JSj. 

■ In addition, there are architectural similarity between neuroscience and machine learning. For exam- 
ple, neuronal spike propagation may correspond to prediction or inference tasks in a learned model; 
synaptic plasticity may correspond to parameter estimation given the network structure; neuroplas- 
ticity may correspond to learning the network structure together with inventing new hidden units in 
the network; axonal path finding in the help of guidance signals may be regarded as structural priors 
making learning network structure easier. 

For this reason, it may be helpful again to refer to how the brain works when dealing with prob- 
lems that are currently puzzling machine learning researchers. For example, training deep net- 
works was made practical mostly after the breakthrough of deep learning starting from 2006 
{e.g. \6\ [T| ["Fl 1^ [10]). The success may lie in a safe way of adapting the network structure, such that 
the parameter estimation part can be effective enough to reach non-trivial local optima. However, 
since the brain is not exactly layer-wise, we may wonder is there any other way of choosing the 
network structure along with parameter estimation? If we want to transfer neuroscience knowledge 
for answering questions as such and build bridge between these two areas, we want to first figure 
out what exactly is the deep network that the brain is representing. 

A good starting point of answering this may be to look at the single neuron computation. Through 
how they transfer information among them, we can do reverse engineering to get both the represen- 
tation and the prediction / inference algorithm, which is the main purpose of this paper 
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In theoretical neuroscience, single neuron models are divided into different levels of detail, (see fTTl 
for a five level survey). Detailed models can be quite realistic and few simplification assumptions 
hold. Still effective simplification holds in different extent. For example, integrate-and-fire models 
capture essential properties of Hodgkin-Huxley model lfT2l . and its variant exponential integrate- 
and-fire neuron models lfT3l are found to be capable of reproducing spike timing of several types 
of cortical neurons. The community has also been investigating on the Linear-Nonlinear-Poisson 
(LNP) models for long |14||15||16|. A recent study in 1 1 7 1 analyzed how LNP models effectively 
reproduce firing rates in more realistic neurons. Based on these results, we start from formally 
presenting the LNP model and then turn to presenting a semi-stochastic inference algorithm on 
Boltzmann machines, which is derived from combining Gibbs sampling and variational inference. 
We then make a detailed matching between LNP model and the inference algorithm. Since the 
semi-stochastic inference algorithm has not been explored in the learning community, we also show 
some experiments illustrating its computational property, including its stochastic convergence and 
its similarity to variational inference. 



2 Brief Review on Neural Plausibility of LNP Model 

In this section, we briefly review the Linear-Nonlinear-Poisson neuron model and its neural plausi- 
bility, focusing on what modeling options we have to fit it with useful learning models. 

The LNP neuron model lfT4l ifTSl lfT6l formalizes the spike train generated by each neuron as a non- 
homogeneous Poisson point process, whose rate function over time depends on the input spike trains 
from its presynaptic neurons, with certain form of short-term memory in the dependence. 

hj{T) = W,,-{X,*a){T) (1) 

U{t) = Multi-Lineai-({/,,(r)}^.^^,(^)) (2) 

K{t) = a{{h*D){T)) (3) 
Xi{T) ^ PoissonPointProcess (Ai (r)) (4) 

In all these equations, r is the continuous time index, whose unit is millisecond. Subscript i is 
the index of the neuron in question, AI [i) is the index set of presynaptic neurons of neuron i. In 

Eq. (|4|l, Xi (r) ~ J2f ^i''' ^ '''i^^) represents the spike train, with spikes as Dirac functions located 

at time steps t^^^ . \i (r) is the rate function of the Poisson point process. Eq. ([TJ represents the 
postsynaptic current at a particular synapse with efficacy Wij . a is a certain non-negative function 
with certain time course. Evidence for "Poisson-like" statistics in cortical neurons can be found 
in |fT8llT9l , which implies that spike counts over a certain interval may be Poisson distributed. 

Eq. (O is the dendritic summation. There are different viewpoints on whether linearity holds true. 
Due to phenomenon like mutual inhibition on different dendritic shafts, it is commonly known 
that nonlinear summation could happen (see fSOlfST] for example), yet studies from fS2\ show 
that when dendritic spines are presented, linear summation is true, and this happens frequently in 
excitatory mammalian cortex. In addition, a survey on neuron models in ifTTl summarized different 
computations that can be performed in dendritic processing, two notable examples are multiplication 
and summation. We include these two and generalizes them into the "Multi-Linear" function in 
Eq. (|2]i, i.e., it is linear with respect to each of its arguments individually. Here we use the set 
notation {■} j^M(i) denote the set of arguments. What is special in a "Multi-Linear" function is 
that, when taking its expectation with respect to a certain argument, the expectation goes in and the 
resulted formula simply replaces the argument by its mean. This is useful for mean-field algorithms. 

After dendritic summation, relationship between the summed current and the firing rate can be 
well researched by in-vitro neuroscience experiments where input current can be externally con- 
trolled. Various frequency-current function can also be derived from popular spike generation mod- 
els, such as Leaky Integrate-and-Fire model (see [12] for an overview). Exponential Integrate- And- 
Fire Model |fT3l , Spike Response Model ||23l . or more realistic models such as Hodgkin-Huxley 
Model f24l. In particular, fT4l fTSl pTSl formulated the LNP simplification through a nonlinear map- 
ping followed by a Poisson point process random sampling. The analytical study with simulation 
in IIT7I show that the simplification is effective. We adopt this idea in Eq. (|3]l. 
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Since neural spikes has refractory period with at least 1ms, fir- 
ing rate will not exceed lOOOHz and there won't be more than 
one spike happening in 1ms, so A; (r) G [0, 1]. In practice, 
firing rate is often even lower In addition, discrete time steps 
in the resolution of 1ms per each step seems quite fine-grained 
for most perceptual tasks {e.g., visual recognition emerges in 
75-80ms ||251 ). For these reasons, discretizing time into 1ms 
per each step may be reasonable. By LeCam's theorem |26|, 
the discrete counterpart of nonhomogeneous Poisson point 
process is the nonhomogeneous Bernoulli process, and the 
Bernoulli probability at each discrete time step is approxi- 
mated by the integration of rate function at that time period. Figure 1 : Frequency-current curve 
We use t e Z to denote the discrete time steps, whose unit of leaky integrate-and-fire neurons 
is millisecond. We use ^ G [0, 1] to denote the BernoulU ^r a particular set of parameters. 

, , .,. , r,^ n , . ■. ■ .• More from different neuron models 

probability and xl G |0, 1} to denote the spike indicator at ^^^^^ ^ 

timet. 




Input Current I 



(<^f^), r\dr)dr (5) 



X. ^ Bernoulli 

^ ^ Jt 

The corresponding discrete form of Eq. ( 111213b is by substituting Xi (r) with xf '' and substituting 

a, D functions with their discrete form similar to the definition of (pf ^ in Eq. (|5]l. There are two 
ways to proceed with the formulation. In case D function is the Dirac function located at 0, or is 
close to that, the convolution with D can be ignored, the discrete counterpart of Eq. (I1I2I3I I are 

I^'-W.,-Y^:^aik)-xt'' 



b^f^ = a I Multi-Linear 



Or if D is non-trivial, but the Multi-Linear function in Eq. dU reduces to a linear summation, by 
the associativity of convolution, we can let e = a * £), and the discrete counterpart of Eq. ( I1I2I3I I 
reduces to the following, assuming finite time course of the function e. 

In ifTTl , D functions resulted from spiking neuron models are not Dirac functions, but are very close. 
In the latter part of this paper, we will focus on the case when linear summation is true (hence Eq. (|7]i 
holds) although multi-linearity in Eq. (|6]l may generalize our formulation to high-order Boltzmann 
machines. Another observation is that in Eq. O, if D is exponential function (true in flTl ) and a is 
exponential function too (true in [ 12 1), and assuming both functions have the same scale parameter, 
e becomes the classic a-function |27| used in postsynaptic modeling. 

Another issue is whether linear summation in Eq. dTji or multi-linear integration in Eq. (|6]l contains 
a constant bias term hi. In f^F], there is such a bias term in the spiking response model. This will 
make our latter part easier, but in our derivation later, we focus on the case when there is no such 
bias. It will be trivial then to turn to the case where bias exists. 

To sum up, the LNP neuron model we are going to deal with in the latter part is as the following, 
assuming H is the index set of neurons whose activity is not bound to observations (external stimuli). 

f xf - Bernoulli f (pf\ 
ForalHGH,tGZ+, , J s^k it-kj\ («) 

a function can be fit from data or from realistic neuron models, and has some relationship with 
the frequency-current curve. According to 11721 071. such nonlinearity in different spiking neuron 
models may be non-decreasing, close to zero towards the left, increase almost linearly in a certain 
interval. Also, since firing rate has upper bounds, cr won't increase unboundedly. For this reason, a 
may be considered as a modification of sigmoid function to meet certain demands of inference. An 
example of frequency-current curve is shown in Fig.[T] See also |[T7| for another type of nonUnearity. 
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In Fig.[ri parameters of the leaky integrate-and-fire model are set as t,„ — 20 ms. A"*"* = 1 ms, R = 
1 mJ7, V — 1 mV (with notations following lfT2ll '). 

In the biological brain, all neurons are computing in parallel. Thus we want to investigate on the 

case when Eq. dHJ is executed in parallel, vectors x*^*^ = {xf'Yl^i and 0^*^ = {<j)f'')"^i overtime 
forms a Markov chain of order K. The latter part focuses on showing its stochastic inference nature. 



3 Semi- Stochastic Inference on Boltzmann Machines 



In this section, first we compare Gibbs sampling and variational inference on Boltzmann machines, 
emphasizing their architectural similarity, then we present the semi-stochastic inference algorithm 
by combining them. Some notations we use in this section will overlap with the last section. This is 
intended since we want to link the quantities in both sections. 

Let Y G {A, B}" be a collection of random variables. Each of them has two possible values A and 
B {e.g., A — Q and _B = 1 in the binary case). Let V, H be a partition of the index set X = {1, n). 
We denote >v as the visible variables, Y-u as the hidden variables. We use lower case ofY to denote 
its values, e.g., yi, yu or y. In this paper we use the Boltzmann machines with softmax units 1291 . 



(a) Calculate proposal sf} = a ( Wiu jv 



(t-1) 



P (y) = 2^ exp - ■ Ipi = uj ■ lyi = vj ■ Viu,jv - IVi^^i- (9) 

\ i.j:»#j;">f e{A,s} iei,ug{A,s} / 

Here we denote [-I as the binary indicator. |-| = 1 if the statement is true and [•] = otherwise. 
V = {Viu,jv)i ,( j t, is a four-dimensional tensor of size 4ri^, c = {ciu)^ « is a matrix of size 2n. Z 
is the normalization constant (partition function), which depends on V and c. Note that this family 
has no more capacity than the original Boltzmann machine 1301 and the Ising model |[3T| . 

In Gibbs sampling without parallelism Il32lll33l . we are given an observed value y\>, and we want 
to find a sequence | j/^' | which converges in distribution to the true posterior p {Y-^ \Yi> = yy ) . 

The algorithm proceeds as follows. First we initialize and let y^^ take on the observed values. 
Then at each iteration t G Z+ we pick an index i G H and do the following, 

^ -hu^ ,Vu e {A,B} 

(b) Sample y^^*'' ^ Bernoulli (^(t>fj, (('fsj > P^ss down samples by y^) = y^.j (10) 

Here we denote Wiujv = Viujv — Viujv and biu = Ciu — Cia with u defined as {u} = {A, B} \ {u}. 
a (x) = 1/ (1 + exp (—x)) is the sigmoid function. M (i) is the index set of Markov blanket of Yi, 
i.e., M (i) = {j e I\j ^ i, 3u, v € {A, B}, s.t. V,u,jv 0}. 

Variational inference with factorized approximation family often takes the form of a mean-field 
version of Gibbs sampling {e.g., f34]). We adopt the same family of variational distributions as 
in |35|, such that qiY-HlQn) = Ilie-H 9 (^d^'^A, where each q{Y^\OiA,diB) is a Bernoulli 
distribution such that — q(Yi — u\6iA, Ois) for u G {A, B}. Q The objective of variational 
inference is to find 0-h by minimizing the loss function L (0-h) — KL {q (Y|0) \ \p (Y)), here 
KL {q\ \p) — Eg log (q/p) is the KL-divergence. 

The algorithm proceeds as: first we randomly initiahze then at each iteration t e Z+ we pick 
an index i G H and do the following, 

(a) Calculate update = a {T.,^Mi^)..e{A,B} ' ^^"''^ " ' ^ 

(b) Assign e^'J = 4'^, pass down = ^^'ZP , Vu G {A, B} . (11) 

Our motivation for combining them is as follows. First, exponential family distributions have suffi- 
cient statistics |36|. By accumulating the empirical expectation of sufficient statistics, we can online 



*For simplicity, when i G V we write 6iu and q {Yi\diA, Oib) as well, except that 6iu for all iterations are 
fixed as the observed value. We denote © = {Oiu)^^j^ ug{a b} write q (Y|©) = Iligi 9 iYi\9iA,0iB). 
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estimate the parameter if data instances are completely observed, e.g., to online estimate a Gaussian 
distribution Af {X\^, a), we can incrementally calculate J2l=i Si=i ' estimate ft and 

fj at each time t by only using these two statistics. 

Now if we take the examples in Gibbs sampling as the online data instances and estimate a distribu- 
tion from within the variational approximation family, we will end up getting a marginal probability 
distribution for each random variable (when mixing is good). However, we can also change the 
way how the sequence is generated, such that the online estimated variational distribution biases 
towards whatever we want. In particular, we can take the calculated distribution in Eq. (fTTTiCa) as the 
proposal distribution, used to sample a new data example, and incrementally update the variational 
distribution by using that. The updated variational distribution serve as the next input to Eq. ( fTTI ). 
If decaying as in ll37ll is used when accumulating sufficient statistics, the variational distribution 
estimated in our Bernoulli case, is simply a weighted sum of the most recent examples: 



g(t) 



6(fc) 



(t-k) 



, for u G {A, B} 



(12) 



The — oo in Eq. ( fTSl i simply means the starting time of accumulation, e function is non-negative and 
should sum up to one for all k in the effective time span. In case constant decaying ratio is used on 
the last accumulated sufficient statistics as in |37|, the e function above decays exponentially. Some 
choices of e function may not have any online updating scheme that produces it. In practice, we can 
either do online updating or let e have a finite time course, up to some K E Z+ . 

In semi-stochastic inference, we choose a non-negative weight function e (fc) which sum up to 1 for 
k S {1, 2, K}. We initialize both y'^''^ and Q^^K In each iteration t £ Z+ we pick an index 
i G Ti and do the following. 



(a) 




(b) 


(*) 
Vi ~ 


(c) 





'jeM{i),ve{A,B} 

BernoulU ( 



At) At) 



(t-fc+1) 



foru e {A,B} 



, for u G {A, B} 



(13) 



In sequential inference, for every other random variable which does not carry out these updating 
steps, we will pass down 6 and most recent K copies of j/'s to the next time step. In step (c), 
additional normalization is needed when t < K. Details are ignored for clarity of presentation. 

In another way, we can also look at algorithm in Eq. ( fT3] l as a "random slowing-down" / stochastic 
approximation / momentum version of variational inference. When an updated variational distri- 
bution is computed, we don't immediately turn to the updated distribution. Instead, we use it to 
sample a new example, and use it to incrementally update the original variational distribution. The 
expectation of the new example will be the same as the updated variational distribution. 



4 Making Detailed Correspondence 

To connect LNP neuron model in Eq. ([8]l with semi-stochastic inference algorithm in Eq. ( fT3] ), 
several issues needs to be resolved: 

1. LNP neurons don't have the bias terms (such as 6i„), although their nonlinear transform 
function may have constant shift identical to all neurons. 

2. In Eq. ( fT3] l, two Linear-Nonlinear procedures in (a)(c) corresponds to only one sampling 
step in (b). This is an architectural difference. 

3. Biological neurons can only be excitatory or inhibitory. 

4. Biological neurons carry out the updates all in parallel. 

To resolve issue [T] we notice that when the weights V and biases c in Eq. ^ are turned into W 
and b in Eq. ( fT3] l, for fixed i, j e I, fixed u, w e {A, B}, and any constant C G R the following 
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operation is invariant to the inference algorithm in Eq. ( fTsT l since ofj and 6'|^ sum up to one. 

^ 1U,JV IU,JV ' 

For this reason, we can always modify the algorithm in Eq. ( fTSl l to make any bias we want. In our 
experiments, when we want to remove a bias 6^, we subtract 6i„/|A/ (z) | from both Wiujv and 
Wiujv for each j e M (i). 

To resolve issue |2] we propose the "Event-Network". We first split every sampling step (b) in 

Eq. (O into two, namely ly^-*'' = A] ~ Bernoulli and fyf'' = B| Bernoulli(0[^). Now we 
can allocate two neurons to take care of the two parts in all three steps (a)(b)(c). Units of the resulted 
network no longer corresponds to random variables, they are now probabilistic events. When we do 
this, there is no guarantee that ofj and d^g will sum up to 1, but this will be approximately true since 
the weights Wiujv'^ are compatible with each other Whether this results in meaningful inference 
algorithms needs further investigation. We show verification in our experiment part. 

To resolve issue [3] once we have a network containing neurons with both excitatory and inhibitory 
outgoing synapses, we can duplicate each neuron into two, both have the same incoming synapses 
with the original neural efficacy. Then one of them take all positive outgoing synapses before split- 
ting, and the other take all negative ones. Both of this and the last modification of the network are 
exactly invariant for variational inference but may only be approximately true for semi-stochastic 
inference. We show experimental verification for this modification as well. 

If after resolving all these issues we denote x'^i^ = [yl*'' = for u G {A, B}, and re-index every- 
thing, the resulted algorithm will be exactly the same as Eq. ^ if executed in parallel. Parallelism 
is not only an issue in semi-stochastic inference. It is noticed in ll33l that when adjacent random 
variables of a Boltzmann machine are updated in parallel in Gibbs sampling, the sample sequence 
does not converge to the right distribution. This issue is also raised very often recently {e.g. |38|). 
Variational inference is a fixed point iteration. So it may converge to the same answer when updated 
all in parallel, but we do observed experimentally that the variational parameters converges to the 
fixed point with great oscillation. Whether the slowing down version alleviates this problem requires 
further investigatiorQ. In fact, if we turn variational inference into a continuous time dynamical sys- 
tem and simulate it with finer time steps, oscillations can be alleviated in a similar way as applying 
momentum. 



5 Experimental Verification 



Ai4/^l6-3g//S/WJ/'H'i&^(33-i'77-^/T^\7l 
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Figure 2: Samples of reconstruction results (selected randomly). Rows from top to bottom: original 
images; VarO; SemiO; SemiEN; SemiB; SemiU. 

In this section, we focus on illustrating that the semi-stochastic inference, as well as all the modi- 
fications in section m are valid Bayesian inference algorithms, which behave similar to variational 
inference. In all experiments, we set e(/c) as exp(— /c/2) normalized to have summation 1 over its 
domain k G {1, 30}. All inferences are carried out in synchronized full parallelism. 



^We observe that different choices of e functions result in different smoothness of the trajectories, but it's 
not clear how. We leave this as future work. 
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Figure 3: Sample trajectory of 6*1*^ in VarO and SemiO. In all above, Blue line is for VarO, Green 
line is for SemiO, and Red line is the moving average of Green line over window size 30. 



To avoid the complication of learning deep Boltzmann machines, we take the learned Boltzmann 
machine from code in |35|. Since we focus on inference on Boltzmann machine, we take the learned 
model before back-propagation. The Boltzmann machine we use consists of three layers, the first 
layer for input image consists of 784 units, the second consists of 500 units and the third consists of 

1000 units. In the following, we refer to 6*^^*^ in both algorithms as the activation of neuron i. 

To see the effect of modifications in section |4l we show results of semi-stochastic inference after 
each of them. Note that variational inference is exactly invariant on all these modifications, hence 
there no need to verify them. We use the following abbreviations in this part: 

• VarO: variational inference applied on the original deep Boltzmann machine. 

• SemiO: semi-stochastic inference applied on the original deep Boltzmann machine. 

• SemiEN: Same as SemiO except each sampling is duplicated and put on two neurons. 

• SemiB: Same as SemiEN except biases are removed by transforming the network. 

• SemiU: Same as SemiB except each neuron is duplicated into two, such that no neuron 
contains out-going synapses with different signs. 

The experiment we did is similar to the reconstruction experiment in fl). We do the following: (i) 
plug in an image on the input layer, (ii) apply the inference algorithm, (iii) frozen the activations 
on the topmost layer and round up to or 1, (iv) set the topmost layer as observed and other layers 
as hidden, (v) infer back, (vi) read off the image as activations on the input layer Both (ii) and (v) 
are randomly initialized and have 100 iterations. For semi-stochastic inference, in steps (iii) and 
(vi) the activations are taken as averages over the last 50 iterations, while for variational inference 
we simply use the last activation of each neuron. This is because semi-stochastic inference has ran- 
dom convergence, activations approach some fixed point with random perturbation (see justification 
later). Some reconstruction examples are shown in Fig.|2] Each algorithm has good and bad cases, 
also note that because of the randomness, even for identical input image and identical initialization, 
semi-stochastic inference may get different results in different trials. 

In figure|3] we showed sample trajectory of activations in step (ii) of VarO and SemiO for the same 

variables. Each trajectory are those O^'^^'s of one neuron over all iterations. We see that activations 
in variational inference converge to a stable solution with oscillation before convergence, while 
semi-stochastic inference has their moving average converges to some similar values with random 
perturbation. Figure |5la) is a scatter plot of mean of trajectories vs. the corresponding converged 
activation in variational inference, extracted from 2000 trajectories randomly chosen from 50 trials. 
We also observed that when the mean is close to extreme values (such as and 1), the standard 
variance will be smaller, as can be found in figure [3] and more clearly in figure [Sfb). This means 
that semi-stochastic inference converges randomly to the variational inference solution, with more 
confident variables having less random perturbation. 

Finally, we investigated on how well the splitting in section |4] preserves identity relationships that 
are expected on SemiU. In particular, if two neurons are resulted from splitting one sampling into 
two in resolving issue |2l their summed activation should be close to 1 . If two neurons are resulted 
from splitting one neuron into two in resolving issue [3] their activation should be equal. Figure |4] 
illustrated three samples and histogram on 2000 random trajectories are given in figure|5jc)(d). 
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Figure 4: Sample trajectory from SemiU. Each plot shows four neurons resulted from two splits 
on a single neuron. Thick dash lines: neurons with positive outgoing weights. Thin solid lines: 
corresponding neurons with negative outgoing weights. Blue and Red lines are for the two neurons 
from splitting one sampling into two. Green line is the average of Blue and Red. Ideally, lines with 
same color should be equal, and green lines should be at 0.5. 




(a) (b) (c) (d) 



Figure 5: Some statistics of the algorithms, (a) means of trajectories in SemiO vs. converged 
activation values in VarO for the same hidden variables, (b) standard deviation vs. mean, from the 
trajectories in SemiO. (c) histogram of standard_deviation(red line + blue line - 1.0). (d) histogram 
of standard_deviation(dash line - solid line). (Here red, blue, dash, solid lines refer to figure|4]) 

6 Conclusions and Discussions 

We pointed out the stochastic inference nature of the LNP neuron models, which may serve as an 
interpretation of neural coding. The stochastic convergence seems reasonable, e.g., when we see 
ambiguous images, our perception will switch back and forth between plausible explanations. 

There are many behavioral experiments showing statistical optimality of perception and learning 
(e.g. f39]) and many calls for neuronal modeling that achieves this optimality (e.g. (40]). Arguably, 
the mode-seeking nature of variational inference 141] make it natural for interpreting perception. 
Since Boltzmann machines with hidden variables are compact universal approximators (e.g. Il42l ). 
modeling knowledge representation by that is safe if the world can be approximately binarized. 
Yet we only show that with particular choices of weights, neurons are capable of representing a 
Boltzmann machine and carry out inference. Then the question is what does this do if the network 
is a recurrent neural network in general. Real biological neural networks may be learned by tak- 
ing the Boltzmann machine representation and undergoing discriminative training or reinforcement 
learning to optimize performance of inference directly (especially plausible when time dimension is 
involved), which yields a model not necessarily convey consistent probabilistic semantics but may 
do so in the limit. The particular form of nonlinearity in biological neurons may also result from 
such optimization. Another issue is the low average activity in biological neurons. It may be a cer- 
tain form of sparse coding |4| on top of inference so that optimizing such inference procedure will 
not yield degenerate solution. In this case, likelihood-based learning such as contrastive divergence 
may still be applicable as a further refinement. 
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